========================================================

This report examines a dataset relating to red variants of a Portugese wine. Although I’m a big fan of red wine, I know very little about what constitutes a high quality wine, and I’m eager to learn more.

Univariate Plots Section

Summary Statistics

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

This dataset consists of about 1600 observations and 11+ variables.

After looking at the summary data and this first histogram of the distribution of quality, one of the first things that I saw is that the majority of wines fall in the the middle of the quality scale. What chemical chararacteristics do wines share on the higher end of the scale (7 or 8)?

Next I wanted to get a bit more familiar with the other variables. I did some research to learn a little more about some of the chemical properties listed so that I wauld know what I was looking at. I divided the variables into smaller groups, though I don’t mean to imply any specific correlations between the the variables in the same groupings.

Acids

Then I adjusted the plots a little to zoom in on the data.

Next I looked at residual sugar and chlorides:

Two very long tails here!

Let’s zoom in.

When reading up on the significance of residual sugar in wine, I learned that the amount of residual sugar determines how “dry” or “sweet” a wine is considered to be. Based on levels I found on a chart online that characterizes wines based on their residual sugar levels, I created a new variable, “sweetness”, which displays whether a wine is characterized as “Dry”, “Off Dry”, “Medium Dry”, “Medium Sweet”, “Sweet” or “Luscious”. Surprisingly, the vast majority of the wines in the dataset are “Dry”" wines (95%), and the rest are “Off Dry”. There were no other catagories represented under sweetness.

## # A tibble: 2 x 2
##   sweetness     n
##      <fctr> <int>
## 1       Dry  1515
## 2   Off Dry    84

After reading about the controversial wine additives, sulfites, (and the contaminants, sulfates) I was curious to see the levels in this sample. Here’s a look at sulfites/sulfates:

Let’s zoom in…

General Properties

Accoding to Wine Spectator, the ideal pH levels for red wine are between 3.3 to 3.6. I am also interested in seeing if their is any correlation between quality and density or alcohol content. Here is quick look at these properties…

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations in this dataset. We will be looking at 12 variables, all numeric, except for ‘quality’, which is an integer.

What is/are the main feature(s) of interest in your dataset?

The main features of interest for me in this dataset are the acids, the residual sugar and level of sweetness, sulfites and pH, density and alcohol percentage. I’m curious about how and if these variables influence quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Including the quality variable in the analysis with other variables will be illuminating, and I’m also hoping to discover if the degree of sweetness has any effect on the perceived quality of the wine.

Did you create any new variables from existing variables in the dataset?

I created a “sweetness” variable based on the levels of residual sugar.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Other than creating a variable for sweetness, I did not perform any operations to tidy, adjust, or change the form of the data.

Bivariate Plots Section

I’m interested in seeing if any of the variables seem to effect the quality variable. I’m also curious to see if sweetness has any correlation with quality.

First let’s look at the acid variables…

## # A tibble: 6 x 4
##   quality fixed_acid_mean fixed_acid_median     n
##     <int>           <dbl>             <dbl> <int>
## 1       3        8.360000              7.50    10
## 2       4        7.779245              7.50    53
## 3       5        8.167254              7.80   681
## 4       6        8.347179              7.90   638
## 5       7        8.872362              8.80   199
## 6       8        8.566667              8.25    18

## 
##  Pearson's product-moment correlation
## 
## data:  wine$fixed.acidity and wine$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

The correlation between quality and fixed acidity appears weak

## # A tibble: 6 x 4
##   quality volatile_acid_mean volatile_acid_median     n
##     <int>              <dbl>                <dbl> <int>
## 1       3          0.8845000                0.845    10
## 2       4          0.6939623                0.670    53
## 3       5          0.5770411                0.580   681
## 4       6          0.4974843                0.490   638
## 5       7          0.4039196                0.370   199
## 6       8          0.4233333                0.370    18

## 
##  Pearson's product-moment correlation
## 
## data:  wine$volatile.acidity and wine$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

There is a moderately negative relationship between quality and volatile acidity.

## # A tibble: 6 x 4
##   quality citric_acid_mean citric_acid_median     n
##     <int>            <dbl>              <dbl> <int>
## 1       3        0.1710000              0.035    10
## 2       4        0.1741509              0.090    53
## 3       5        0.2436858              0.230   681
## 4       6        0.2738245              0.260   638
## 5       7        0.3751759              0.400   199
## 6       8        0.3911111              0.420    18

## 
##  Pearson's product-moment correlation
## 
## data:  wine$citric.acid and wine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

There is a positive (but relatively weak) relationship between quality and citric acid.

## # A tibble: 6 x 4
##   quality residual_sugar_mean residual_sugar_median     n
##     <int>               <dbl>                 <dbl> <int>
## 1       3            2.635000                   2.1    10
## 2       4            2.694340                   2.1    53
## 3       5            2.528855                   2.2   681
## 4       6            2.477194                   2.2   638
## 5       7            2.720603                   2.3   199
## 6       8            2.577778                   2.1    18

## 
##  Pearson's product-moment correlation
## 
## data:  wine$residual.sugar and wine$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

There are a lot of outliers in the 5, 6 and 7 quality range, but there is no strong correlation between quality and residual sugar.

## # A tibble: 6 x 4
##   quality chlorides_mean chlorides_median     n
##     <int>          <dbl>            <dbl> <int>
## 1       3     0.12250000           0.0905    10
## 2       4     0.09067925           0.0800    53
## 3       5     0.09273568           0.0810   681
## 4       6     0.08495611           0.0780   638
## 5       7     0.07658794           0.0730   199
## 6       8     0.06844444           0.0705    18

## 
##  Pearson's product-moment correlation
## 
## data:  wine$chlorides and wine$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

There are a lot of outliers in the 5 and 6 quality range, but generally there is no strong correlation between chlorides and quality.

## # A tibble: 6 x 4
##   quality free_sulfur_mean free_sulfur_median     n
##     <int>            <dbl>              <dbl> <int>
## 1       3         11.00000                6.0    10
## 2       4         12.26415               11.0    53
## 3       5         16.98385               15.0   681
## 4       6         15.71160               14.0   638
## 5       7         14.04523               11.0   199
## 6       8         13.27778                7.5    18

## 
##  Pearson's product-moment correlation
## 
## data:  wine$free.sulfur.dioxide and wine$quality
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606

There is is no strong correlaation between free sulfur dioxide and quality.

## # A tibble: 6 x 4
##   quality total_sulfur_mean total_sulfur_median     n
##     <int>             <dbl>               <dbl> <int>
## 1       3          24.90000                15.0    10
## 2       4          36.24528                26.0    53
## 3       5          56.51395                47.0   681
## 4       6          40.86991                35.0   638
## 5       7          35.02010                27.0   199
## 6       8          33.44444                21.5    18

## 
##  Pearson's product-moment correlation
## 
## data:  wine$total.sulfur.dioxide and wine$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

There is no strong correlation between quality and total sulfur dioxide.

## # A tibble: 6 x 4
##   quality sulphates_mean sulphates_median     n
##     <int>          <dbl>            <dbl> <int>
## 1       3      0.5700000            0.545    10
## 2       4      0.5964151            0.560    53
## 3       5      0.6209692            0.580   681
## 4       6      0.6753292            0.640   638
## 5       7      0.7412563            0.740   199
## 6       8      0.7677778            0.740    18

## 
##  Pearson's product-moment correlation
## 
## data:  wine$sulphates and wine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

Sulfate levels in this sample are slightly higher in the quality levels above 5. There is a correlation (though relatively weak) between sulfates and quality.

## # A tibble: 6 x 4
##   quality density_mean density_median     n
##     <int>        <dbl>          <dbl> <int>
## 1       3    0.9974640       0.997565    10
## 2       4    0.9965425       0.996500    53
## 3       5    0.9971036       0.997000   681
## 4       6    0.9966151       0.996560   638
## 5       7    0.9961043       0.995770   199
## 6       8    0.9952122       0.994940    18

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

There is no strong correlation between density and quality.

## # A tibble: 6 x 4
##   quality  pH_mean pH_median     n
##     <int>    <dbl>     <dbl> <int>
## 1       3 3.398000      3.39    10
## 2       4 3.381509      3.37    53
## 3       5 3.304949      3.30   681
## 4       6 3.318072      3.32   638
## 5       7 3.290754      3.28   199
## 6       8 3.267222      3.23    18

## 
##  Pearson's product-moment correlation
## 
## data:  wine$pH and wine$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

Interestingly, according to Wine Spectator, the best pH for red wines is between 3.3 and 3.6, which both the mean and the median are in between for all quality levels, other than the two best levels represented in this dataset, 7 and 8, where the mean and median levels are slightly less than desirable range. However there is a very weak correlation between pH and quality (-0.058).

## # A tibble: 6 x 4
##   quality alcohol_mean alcohol_median     n
##     <int>        <dbl>          <dbl> <int>
## 1       3     9.955000          9.925    10
## 2       4    10.265094         10.000    53
## 3       5     9.899706          9.700   681
## 4       6    10.629519         10.500   638
## 5       7    11.465913         11.500   199
## 6       8    12.094444         12.150    18

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

There seems to be a substantial increase in alcohol levels as quality increases, and a fairly strong correlation between alcohol and quality.

Examining sweetness

First I plotted the proportion of each quality level in each level of
sweetness

But it was difficult for me to tell whether ‘dry’ or ‘off dry’ wines were of better quality. So I ran a summary of the mean and median quality levels of both:

## # A tibble: 2 x 4
##   sweetness quality_mean quality_median     n
##      <fctr>        <dbl>          <dbl> <int>
## 1       Dry     5.631683              6  1515
## 2   Off Dry     5.714286              6    84

Clearly both ‘dry’ and ‘off dry’ wines are very close in average quality, so I know sweetness level is not important to level of quality in this dataset. This shouldn’t be too surprising after seeing how low the correlation coeffecient is between quality and residual sugar.

Alcohol and density

I was reminded while doing a little research on wine components that alcohol is less dense than water. So, I wanted to plot alcohol and density, expecting to see density decrease as alcohol level increased.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

Sure enough, as alcohol level increases, density decreases. There is a fairly strong correlation between alcohol and density (.496).

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

My main points of note in this investigation were: there is a moderate negative relationship between quality and volatile acidity. There is a relatively weak positive relationship between quality and citric acid. There is a relatively weak positive relationship between quality and sulphates. The best pH for red wines is between 3.3 and 3.6, which both the mean and the median are in between for all quality levels other than the two best levels represented in this dataset, 7 and 8, where the mean and median levels are slightly less than desirable range. However there is a very weak correlation between pH and quality (-0.058). There seems to be a substantial increase in alcohol levels as quality increases, and a fairly strong correlation between alcohol and quality. As alcohol level increases, density decreases. There is a fairly strong correlation between alcohol and density.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I was hoping to find an interesting correlation between sweetness and quality, but found none. I also examined alcohol and density, and found a fairly strong relationship between the two.

What was the strongest relationship you found?

The strongest relationships I found were between alcohol and density (with a correlation coefficent of .496) and alcohol and quality (with a correlation coefficent of .476).

Multivariate Plots Section

Now that I’ve found which variables have the more positive relationship with quality (alchohol, sulphates and citric acid) , I’m interestd in seeing how those variables relate to each other, and their combined effect on quality.

Alcohol, Citric Acid and Quality

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$citric.acid
## t = 4.4188, df = 1597, p-value = 1.059e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06121189 0.15807276
## sample estimates:
##       cor 
## 0.1099032

This plot shows that there are many instances of higher quality wines that have higher citric acid or alcohol, independent of the other variable. The correlation between alcohol and citric acid proves to be weak (correlation coefficient is .110).

Sulphates, Citric Acid and Quality

## 
##  Pearson's product-moment correlation
## 
## data:  wine$sulphates and wine$citric.acid
## t = 13.159, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2678558 0.3563278
## sample estimates:
##     cor 
## 0.31277

It seems as if there are many higher quality wines that are high in both citric acid and sulphates. Interestingly, there is a moderately positive relationship betweeen citric acid and sulphates (correlation coefficient .313)

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$sulphates
## t = 3.7568, df = 1597, p-value = 0.0001783
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04477906 0.14196454
## sample estimates:
##        cor 
## 0.09359475

It seems as if the higher quality wines are fairly spread out in this plot. There is a weak correlation between alcohol and sulphates (correlation coefficient is .094).

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The two variables that I examined here that had the strongest relationship were citric acid and sulphates, while the weakest were sulfates and alcohol.

Were there any interesting or surprising interactions between features?

Although higher quality wines tend to have higher levels of alchohol and citric acid, there is a weaker relationship between alcohol and citric acid than I was expecting.


Final Plots and Summary

Plot One

Description One

This boxplot illustrates the relationship between alcohol and quality. I chose

it because these two variables had the strongest correlation of the initial

variables I looked at.

Plot Two

Although I was hoping to find an interesting relationship between sweetness and quality, this plot represents one of the weakest correlations that I discovered between two variables.

Plot Three

I chose this plot because it showed a surprisingly strong correlation between sulphates and citric acid. This relationship might warrant further investigation.


Reflection

I learned a great deal through the process of completing this project. I was interested doing some background research into the chemical components of wine, of which I knew very little. It was exciting to learn that based on the level of residual sugar in the wines, I could create a new categorical variable that labeled the wines according to how dry or sweet the wines were. It was disappointing to find out that the vast majority of wines were dry, and that the degree of sweetness had no significant correlation to the quality of the wine.

However, it was interesting to see that levels of alchohol, citric acid and sulphates did have a positive relationship with quality, and to also confirm that alchohol and density would have a strong correlation with each other. It was surprising to find that sulphates and citric acid had a moderately positive correlation.

I enjoyed this project because it showed me how doing EDA can lead to better and more informed questions about your dataset– I found some interesting relationships but know I would have to dig deeper in order to draw any conclusions.

In the future, it would be interesting to ammend the dataset to include a wider sample and range of variables such as residual sugar levels, as well as other factors such as where the grapes used to make the wine where from, what yearthe wine was created, etc. I think that would be helpful to make broader conclusions about what variables influence quality.